feat: dask namespace `concat` method #840

benrutter · 2024-08-21T13:58:20Z

What type of PR is this? (check all applicable)

Related issues

Related issue feat: add methods to Dask backend #637

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

I've added in a concat method returning a DaskLazyFrame. One thing that wasn't tested for the concat_test was whether an error gets thrown trying to horizontally concat frames of different length. I'm not 100% sure what happens within polars/pandas when you do this, but dask will throw a confusing assertion error when you try to access anything within the dataframe, so I thought putting in a handy check here was worthwhile.

FBruzzesi

Hey thanks for the PR, that was fast :)

I left a couple of comments, the main concern is about triggering computation on behalf of the user, which we would like to avoid as much as possible.

I would also double check the behaviour of the default join="outer" argument for dd.concat.

Finally, regarding the missing test, feel free to add it to get higher/better coverage of edge cases.

narwhals/_dask/namespace.py

benrutter · 2024-08-21T15:02:39Z

Cheers @FBruzzesi!

Think you're right around the behaviour of "outer" by default. In theory, I think it only comes into play on "vertical" merges (otherwise, mismatching indexes throw errors) where columns are checked to be matching already, but that sounds a little like an error waiting to happen.

Just looked at adding another test, but looks like its totally fine to horizontally concatenate different length frames outside of dask. For now, I've taken out the check for matching lengths- it'll avoid any computing for figuring out length, and pass over handling the error to dask, which is probably preferable? (let me know if I'm missing a trick on how to check concats are likely to be valid though)

benrutter · 2024-08-21T15:11:34Z

Also, not very related, but quick question on tests! Looks like ubuntu tests are failing at the moment because they require 100% coverage and there's something like 99.7% (macos requires 90% and windows 95%).

Is it intentionally that they're all different? If it's be handy I can bundle updating to make them all match in with this PR.

FBruzzesi · 2024-08-21T16:24:59Z

Hey sorry I am from mobile, so this could be a sloppy comment. I will try to do my best.

Think you're right around the behaviour of "outer" by default. In theory, I think it only comes into play on "vertical" merges (otherwise, mismatching indexes throw errors) where columns are checked to be matching already, but that sounds a little like an error waiting to happen.

Alright, to me the join keyword resonates more on a horizontal concat but apparently not. I am happy to keep the check on the column names.

Just looked at adding another test, but looks like its totally fine to horizontally concatenate different length frames outside of dask. For now, I've taken out the check for matching lengths- it'll avoid any computing for figuring out length, and pass over handling the error to dask, which is probably preferable? (let me know if I'm missing a trick on how to check concats are likely to be valid though)

Maybe we can just let dask run its behavior without manually checking for dataframe lengths. This is what we are doing for pandas, yet I would try to see how Polars LazyFrame behaves (i.e. if it throws an error, and if it does it most likely happens only at collect time).

benrutter · 2024-08-22T08:27:09Z

Thanks @FBruzzesi, comment is really helpful! At the moment, the changes I've made since my initial commit (have squashed the commits together for neatness):

keyword argument of "inner" now passed into dd.concat - I think this is the desired behaviour?
I've taken out length checks to avoid unnecessary compute and just let dask do its own thing with checks
Extra tweak: I also swapped the import dask.dataframe as dd # ignore-banned-import for dd = get_dask_dataframe() (I think that's better practice maybe?)

On the join="inner" vs join="outer" behaviour, where you thinking around any specific cases? I can add some extra tests in if there's an example that you have in mind.

FBruzzesi · 2024-08-22T09:43:09Z

keyword argument of "inner" now passed into dd.concat - I think this is the desired behaviour?

Thanks! I am still a bit unsure about the difference. I can run some tests later today or most likely tomorrow, unless you have some examples ready to show the difference in behavior.

I've taken out length checks to avoid unnecessary compute and just let dask do its own thing with checks

I can still see the check using len?!

Extra tweak: I also swapped the import dask.dataframe as dd # ignore-banned-import for dd = get_dask_dataframe() (I think that's better practice maybe?)

We recently changed all those (check #788), so I would suggest to keep it as before:

- dd = get_dask_dataframe()
+ import dask.dataframe as dd  # ignore-banned-import

On the join="inner" vs join="outer" behaviour, where you thinking around any specific cases? I can add some extra tests in if there's an example that you have in mind.

Back to the first point

benrutter · 2024-08-22T10:56:57Z

Well that's embarrassing! Looks like my "neat squashing" was just a "neat deleting a commit" for the validation check removal 🫠 All taken out now.

Ok, I've done some playing around and actually looks like dask can concatenate on axis 1 with different lengths anyway (I don't know if this works with expression based columns though, definitely was getting an assertion error previously). That's pretty handy because it makes the outer/inner behaviour really clear for both!

I'll explain them here, and hopefully you can give me a steer on the desired Narwhals behaviour? (I think I know but not 100% sure)

import dask.dataframe

left = dd.from_dict({"a": [1, 2], "b": [1, 2]}, npartitions=1)
right = dd.from_dict({"a": [1, 2, 3], "b": [1, 2, 3], "c": [1, 2, 3]}, npartitions=1)

# inner on axis 0 (vertical)
dd.concat([right, left], axis=0, join="inner").compute()
#   a b
# 0 1 1
# 1 2 2
# 2 3 3
# 0 1 1
# 1 2 2

# outer on axis 0 (vertical)
dd.concat([right, left], axis=0, join="outer").compute()
#   a b c
# 0 1 1 1.0
# 1 2 2 2.0
# 2 3 3 3.0
# 0 1 1 NaN
# 1 2 2 NaN


# inner on axis 1 (horizontal)
dd.concat([right, left], axis=1, join="inner").compute()
#   a b c a b
# 0 1 1 1 1 1
# 1 2 2 2 2 2

# outer on axis 1 (horizontal)
dd.concat([right, left], axis=1, join="outer").compute()
#   a b c a   b
# 0 1 1 1 1.0 1.0
# 1 2 2 2 2.0 2.0
# 2 3 3 3 NaN Nan

Hopefully that makes it a little clearer? Think I muddied the water a little by being confused before on the behaviour for axis=1, sorry!

Looking at the other examples in narwhals I think that when axis=0 (vertical concat) we'd want join=inner and when 'axis=1' (horizontal concat) we'd want join=outer?

I've implemented the above behaviour for now - would appreciate feedback on if that's logical though! 🙏

FBruzzesi · 2024-08-22T14:40:48Z

Awesome, that makes it much more clear and saved me a bunch of time, thanks a lot.

I am checking polars LazyFrame behavior, which is what we should aim to replicate:

Edit:

TL;DR

how="vertical" behavior should be easy to replicate
how="horizontal" we could end up with a stricter behavior for dask than Polars LazyFrame, but I think that's still ok.

`how="vertical"`

import polars as pl

data1 = {"a": [1, 2, 3], "b": [1, 2, 3], "c": [1, 2, 3]}
data2 = {"a": [1, 2], "b": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="vertical")  # this doesn't raise just yet, but then when we collect
df.collect()

ShapeError: unable to append to a DataFrame of width 3 with a DataFrame of width 2

So I think we can raise early if columns are not same for vertical, exactly how you are already doing. Once the check is done, join keyword does not matter anymore since the columns will already be the same.

Edit: Actually the column order also matter in polars, i.e. the following fails:

data1 = {"a": [1, 2, 3], "b": [1, 2, 3]}
data2 = {"b": [1, 2], "a": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="vertical").collect()

ShapeError: unable to vstack, column names don't match: "a" and "b"

`how="horizontal"`

First check is done on the duplicate column names, before collect time:

data1 = {"a": [1, 2, 3], "b": [1, 2, 3]}
data2 = {"b": [1, 2], "a": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="horizontal")

DuplicateError: Column with name 'a' has more than one occurrence

This such check should be fairly easy to achieve in dask as well.

If the column names are ok, then, actually the different length does not matter:

data1 = {"a": [1, 2, 3], "b": [1, 2, 3]}
data2 = {"c": [1, 200], "d": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="horizontal")
df.collect()

shape: (3, 4)
┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ c    ┆ d    │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 1   ┆ 1    ┆ 1    │
│ 2   ┆ 2   ┆ 200  ┆ 2    │
│ 3   ┆ 3   ┆ null ┆ null │
└─────┴─────┴──────┴──────┘

This behavior I was not able to replicate in dask, meaning that at compute time if the length is different I am getting an AssertionError with whatever parameter I am trying to play with.

I think this is ok as we can be more restrictive than polars, but I would like to end up with the same result (cc: @MarcoGorelli )

Hope this helps

Final edit: I was not able to fully replicate what you have done @benrutter (I am working with dask '2024.6.2')

Update narwhals/_dask/namespace.py Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> import change inner kwarg doh! dynamic join dumb typo validation

benrutter · 2024-08-23T13:18:15Z

Thanks @FBruzzesi that's very helpful! I've made a couple more changes off the basis of that:

Changed the validation to check for order as well as column names matching
Added validation for horizontal concat to ensure column names aren't duplicated

I've tested adding in some new tests, which I this would be ideal, here's the tests I added:

# extra condition for vertical concat to check that column order is required
with pytest.raises((Exception, TypeError)):
    reversed_data_right = dict(reversed(data_right.items()))
    reversed_df_right = nw.from_native(constructor(reversed_data_right)).lazy()
    nw.concat([df_left, reversed_df_right], how="vertical").collect()

Only that actually doesn't throw an error for most of them. And then the same again with:

# extra condition to check that duplicate columns are banned
    with pytest.raises(Exception, match="occurence"):
        duplicate_columns = nw.from_native(constructor(data_right | data))
        nw.concat([df_left, df_right], how="horizontal")

Which also doesn't throw an error. I'm down to try and address this for other dataframe types if Narwhals want to be strict on those conditions, but I guess it'd probably be better as a seperate issue/pr, so I've just left out those tests for now.

Edit: I think you found the reason I was seeing inconsistent behaviour with dask! Initially I was trying on the latest version, but the examples I shared where from an installation of 2024.2 before expressions was opt-out.

FBruzzesi

Thanks for adjusting on all fronts and already planning some further tests.

I commented on some error raise since:

in other dataframe we raise AssertionError for empty list of items
other errors are certainly not type related, so we can keep AssertionError for now.

I think we can merge after that :)

narwhals/_dask/namespace.py

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

benrutter · 2024-08-23T14:54:36Z

Awesome thanks @FBruzzesi - I've made those changes (as an FYI I also raised an issue with over on dask/dask to check around the expected behaviour for concatenating on axis=1 dask/dask#11343)

narwhals/_dask/namespace.py

Co-authored-by: Aidos Kanapyanov <65722512+aidoskanapyanov@users.noreply.github.com>

for more information, see https://pre-commit.ci

FBruzzesi

Logic is 100% in place, there are a couple of issues with coverage for:

the new checks we discussed
the variables introduced for mypy type hints

I commented on how to deal with these cases - as soon as all CI is green we are ready to merge

narwhals/_dask/namespace.py

FBruzzesi

Ready to ship it? 🚀 Thanks @benrutter

MarcoGorelli · 2024-08-28T09:15:02Z

nice one, thanks all! 🙌

github-actions bot added enhancement New feature or request internal labels Aug 21, 2024

FBruzzesi reviewed Aug 21, 2024

View reviewed changes

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

benrutter force-pushed the main branch 2 times, most recently from 827c90f to 436c323 Compare August 21, 2024 15:00

benrutter force-pushed the main branch 3 times, most recently from a8560c4 to 9700bf9 Compare August 22, 2024 10:56

DeaMariaLeon removed the internal label Aug 22, 2024

concat support

a03d45b

Update narwhals/_dask/namespace.py Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> import change inner kwarg doh! dynamic join dumb typo validation

benrutter force-pushed the main branch from 9700bf9 to a03d45b Compare August 23, 2024 13:09

Merge branch 'main' into main

d44a874

FBruzzesi reviewed Aug 23, 2024

View reviewed changes

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

benrutter and others added 4 commits August 23, 2024 15:48

ValueError raise changed to AssertionError

76b6412

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

TypeError changes to AssertionError

836a9fe

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

TypeError changed to AsssertionError

2a3311d

Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>

pragma no cover

3925b7f

aidoskanapyanov reviewed Aug 24, 2024

View reviewed changes

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

benrutter and others added 2 commits August 25, 2024 13:38

List comprehension over chain

2ddcccf

Co-authored-by: Aidos Kanapyanov <65722512+aidoskanapyanov@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

99cb1e9

for more information, see https://pre-commit.ci

FBruzzesi reviewed Aug 25, 2024

View reviewed changes

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved

benrutter and others added 3 commits August 27, 2024 16:55

refactor for coverage

ac0e99c

Merge branch 'main' into main

c3395e3

pragma no cover on message, removed type checking import already in main

58123b3

FBruzzesi approved these changes Aug 28, 2024

View reviewed changes

FBruzzesi merged commit a0af8ee into narwhals-dev:main Aug 28, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dask namespace `concat` method #840

feat: dask namespace `concat` method #840

benrutter commented Aug 21, 2024

FBruzzesi left a comment

benrutter commented Aug 21, 2024

benrutter commented Aug 21, 2024

FBruzzesi commented Aug 21, 2024

benrutter commented Aug 22, 2024

FBruzzesi commented Aug 22, 2024 •

edited

Loading

benrutter commented Aug 22, 2024 •

edited

Loading

FBruzzesi commented Aug 22, 2024 •

edited

Loading

benrutter commented Aug 23, 2024 •

edited

Loading

FBruzzesi left a comment

benrutter commented Aug 23, 2024

FBruzzesi left a comment

FBruzzesi left a comment •

edited

Loading

MarcoGorelli commented Aug 28, 2024

feat: dask namespace concat method #840

feat: dask namespace concat method #840

Conversation

benrutter commented Aug 21, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

FBruzzesi left a comment

Choose a reason for hiding this comment

benrutter commented Aug 21, 2024

benrutter commented Aug 21, 2024

FBruzzesi commented Aug 21, 2024

benrutter commented Aug 22, 2024

FBruzzesi commented Aug 22, 2024 • edited Loading

benrutter commented Aug 22, 2024 • edited Loading

FBruzzesi commented Aug 22, 2024 • edited Loading

TL;DR

how="vertical"

how="horizontal"

benrutter commented Aug 23, 2024 • edited Loading

FBruzzesi left a comment

Choose a reason for hiding this comment

benrutter commented Aug 23, 2024

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi left a comment • edited Loading

Choose a reason for hiding this comment

MarcoGorelli commented Aug 28, 2024

feat: dask namespace `concat` method #840

feat: dask namespace `concat` method #840

FBruzzesi commented Aug 22, 2024 •

edited

Loading

benrutter commented Aug 22, 2024 •

edited

Loading

FBruzzesi commented Aug 22, 2024 •

edited

Loading

`how="vertical"`

`how="horizontal"`

benrutter commented Aug 23, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading